Skip to content

fix: MCP client auto reconnect when MCP session is not available#8682

Open
SlightDust wants to merge 2 commits into
AstrBotDevs:masterfrom
SlightDust:fix-MCP-reconnect-when-Session-terminated
Open

fix: MCP client auto reconnect when MCP session is not available#8682
SlightDust wants to merge 2 commits into
AstrBotDevs:masterfrom
SlightDust:fix-MCP-reconnect-when-Session-terminated

Conversation

@SlightDust

@SlightDust SlightDust commented Jun 8, 2026

Copy link
Copy Markdown

Modifications / 改动点

MCP服务端重启的话,Astrbot这边的调用会持续报404,所以让ai改了重试机制,在发现MCP不可用时尝试重连3次。
fix #8681

  • This is NOT a breaking change. / 这不是一个破坏性变更。

Screenshots or Test Results / 运行截图或测试结果

image

Checklist / 检查清单

  • 😊 If there are new features added in the PR, I have discussed it with the authors through issues/emails, etc.
    / 如果 PR 中有新加入的功能,已经通过 Issue / 邮件等方式和作者讨论过。

  • 👀 My changes have been well-tested, and "Verification Steps" and "Screenshots" have been provided above.
    / 我的更改经过了良好的测试,并已在上方提供了“验证步骤”和“运行截图”

  • 🤓 I have ensured that no new dependencies are introduced, OR if new dependencies are introduced, they have been added to the appropriate locations in requirements.txt and pyproject.toml.
    / 我确保没有引入新依赖库,或者引入了新依赖库的同时将其添加到 requirements.txtpyproject.toml 文件相应位置。

  • 😮 My changes do not introduce malicious code.
    / 我的更改没有引入恶意代码。

Summary by Sourcery

Improve MCP client tool invocation resilience by broadening connection error detection and implementing a custom multi-attempt reconnect loop.

Bug Fixes:

  • Handle MCP server restarts by treating a wider range of transport and protocol errors (including 404 and session termination) as reconnectable conditions.
  • Retry MCP tool calls up to three times with reconnection attempts when the MCP session is missing or the connection breaks, instead of failing after a single closed-resource error.

@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. area:core The bug / feature is about astrbot's core, backend labels Jun 8, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the automatic reconnection logic in mcp_client.py by replacing the tenacity retry decorator with a custom retry loop that handles specific transport and connection errors. However, several critical issues were identified in the review: a syntax error was introduced where self.session.call_tool was omitted, top-level imports of optional dependencies (McpError and anyio) will cause import failures if they are not installed, and catching BaseException instead of Exception risks intercepting system-level exceptions like KeyboardInterrupt.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

stop_after_attempt,
wait_exponential,
)
from mcp.shared.exceptions import McpError

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The import of McpError from mcp.shared.exceptions is performed at the top level. However, mcp is an optional dependency in this codebase (as indicated by the conditional import block on lines 91-98). If mcp is not installed, importing mcp_client.py will fail immediately with a ModuleNotFoundError at startup.

Please remove this top-level import and instead import McpError dynamically or handle its import failure gracefully.

Comment on lines +111 to +123
_TRANSPORT_ERRORS: tuple[type[Exception], ...] = (
anyio.ClosedResourceError,
anyio.BrokenResourceError,
anyio.EndOfStream,
ConnectionError,
ConnectionResetError,
BrokenPipeError,
httpx.HTTPStatusError,
httpx.ConnectError,
httpx.ReadError,
httpx.RemoteProtocolError,
McpError,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Defining _TRANSPORT_ERRORS with anyio.ClosedResourceError and McpError at the module level will raise a NameError or ModuleNotFoundError if anyio or mcp are not installed (since they are optional dependencies imported conditionally).

To prevent this, we should construct _TRANSPORT_ERRORS dynamically by safely attempting to import these optional exceptions.

_transport_errors_list: list[type[Exception]] = [
    ConnectionError,
    ConnectionResetError,
    BrokenPipeError,
    httpx.HTTPStatusError,
    httpx.ConnectError,
    httpx.ReadError,
    httpx.RemoteProtocolError,
]

try:
    import anyio
    _transport_errors_list.extend([
        anyio.ClosedResourceError,
        anyio.BrokenResourceError,
        anyio.EndOfStream,
    ])
except ImportError:
    pass

try:
    from mcp.shared.exceptions import McpError
    _transport_errors_list.append(McpError)
except ImportError:
    pass

_TRANSPORT_ERRORS = tuple(_transport_errors_list)

Comment thread astrbot/core/agent/mcp_client.py Outdated
)


def _is_connection_error(exc: BaseException) -> bool:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Change the parameter type of _is_connection_error from BaseException to Exception to align with catching Exception.

def _is_connection_error(exc: Exception) -> bool:

reraise=True,
)
async def _call_with_retry():
last_exc: BaseException | None = None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Change the type annotation of last_exc from BaseException to Exception to align with catching Exception instead of BaseException.

        last_exc: Exception | None = None

read_timeout_seconds=read_timeout_seconds,
)
except anyio.ClosedResourceError:
except BaseException as e:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Catching BaseException is generally discouraged because it intercepts system-initiating exceptions like KeyboardInterrupt, SystemExit, and asyncio.CancelledError. This can lead to unexpected behavior or prevent clean shutdowns/cancellations.

Since all connection and transport-related exceptions inherit from Exception, we should catch Exception instead.

            except Exception as e:

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 3 issues, and left some high level feedback:

  • In call_tool_with_reconnect, the line that actually invokes the tool (self.session.call_tool(...)) appears to have been removed, so return await ( will raise a syntax/runtime error; you likely want return await self.session.call_tool(...) as before with the new retry logic around it.
  • The new manual retry loop removes the previous exponential backoff and detailed tenacity logging; if repeated reconnects are expected in some deployments, consider adding a small delay or backoff between attempts and ensuring failures are logged in a way that preserves the previous observability.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `call_tool_with_reconnect`, the line that actually invokes the tool (`self.session.call_tool(...)`) appears to have been removed, so `return await (` will raise a syntax/runtime error; you likely want `return await self.session.call_tool(...)` as before with the new retry logic around it.
- The new manual retry loop removes the previous exponential backoff and detailed tenacity logging; if repeated reconnects are expected in some deployments, consider adding a small delay or backoff between attempts and ensuring failures are logged in a way that preserves the previous observability.

## Individual Comments

### Comment 1
<location path="astrbot/core/agent/mcp_client.py" line_range="666-669" />
<code_context>

             try:
-                return await self.session.call_tool(
+                return await (
                     name=tool_name,
                     arguments=arguments,
                     read_timeout_seconds=read_timeout_seconds,
                 )
-            except anyio.ClosedResourceError:
</code_context>
<issue_to_address>
**issue (bug_risk):** The call to `call_tool` is missing the target coroutine (`self.session.call_tool`), making this invalid and likely a runtime error.

Previously this block correctly called `self.session.call_tool(...)`, but now it just awaits a parenthesized list of keyword arguments, which is invalid syntax and will fail at runtime. Please restore the explicit call, e.g.

```python
return await self.session.call_tool(
    name=tool_name,
    arguments=arguments,
    read_timeout_seconds=read_timeout_seconds,
)
```
</issue_to_address>

### Comment 2
<location path="astrbot/core/agent/mcp_client.py" line_range="671" />
<code_context>
                     read_timeout_seconds=read_timeout_seconds,
                 )
-            except anyio.ClosedResourceError:
+            except BaseException as e:
+                last_exc = e
+
+                if not _is_connection_error(e):
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Catching `BaseException` may swallow cancellations and other critical signals; consider restricting to `Exception`.

This will also catch `asyncio.CancelledError`, `KeyboardInterrupt`, and `SystemExit`. In async code that can break task cancellation and shutdown, especially if this layer retries or suppresses the error. Limiting the handler to `Exception` and letting cancellations/termination signals propagate is safer:

```python
except Exception as e:
    ...
```

```suggestion
            except Exception as e:
```
</issue_to_address>

### Comment 3
<location path="astrbot/core/agent/mcp_client.py" line_range="125-133" />
<code_context>
+)
+
+# Keywords in exception messages that indicate a transport/connection problem.
+_CONNECTION_ERROR_KEYWORDS = (
+    "session terminated",
+    "terminated",
+    "closed resource",
+    "broken resource",
</code_context>
<issue_to_address>
**suggestion (bug_risk):** The generic keyword "terminated" may falsely classify unrelated errors as connection issues.

This broad substring match risks treating unrelated lifecycle or business errors as transport failures, triggering unnecessary reconnects. Consider restricting matching to more specific phrases (e.g., variants of the existing "session terminated") or otherwise tightening the condition so it only captures genuine connection/transport errors.

```suggestion
# Keywords in exception messages that indicate a transport/connection problem.
# Note: avoid overly generic substrings (e.g. just "terminated") to prevent
# misclassifying unrelated lifecycle or business errors as connection issues.
_CONNECTION_ERROR_KEYWORDS = (
    "session terminated",
    "session was terminated",
    "closed resource",
    "broken resource",
    "connection closed",
    "connection reset",
)
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread astrbot/core/agent/mcp_client.py Outdated
read_timeout_seconds=read_timeout_seconds,
)
except anyio.ClosedResourceError:
except BaseException as e:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Catching BaseException may swallow cancellations and other critical signals; consider restricting to Exception.

This will also catch asyncio.CancelledError, KeyboardInterrupt, and SystemExit. In async code that can break task cancellation and shutdown, especially if this layer retries or suppresses the error. Limiting the handler to Exception and letting cancellations/termination signals propagate is safer:

except Exception as e:
    ...
Suggested change
except BaseException as e:
except Exception as e:

Comment on lines +125 to +133
# Keywords in exception messages that indicate a transport/connection problem.
_CONNECTION_ERROR_KEYWORDS = (
"session terminated",
"terminated",
"closed resource",
"broken resource",
"connection closed",
"connection reset",
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): The generic keyword "terminated" may falsely classify unrelated errors as connection issues.

This broad substring match risks treating unrelated lifecycle or business errors as transport failures, triggering unnecessary reconnects. Consider restricting matching to more specific phrases (e.g., variants of the existing "session terminated") or otherwise tightening the condition so it only captures genuine connection/transport errors.

Suggested change
# Keywords in exception messages that indicate a transport/connection problem.
_CONNECTION_ERROR_KEYWORDS = (
"session terminated",
"terminated",
"closed resource",
"broken resource",
"connection closed",
"connection reset",
)
# Keywords in exception messages that indicate a transport/connection problem.
# Note: avoid overly generic substrings (e.g. just "terminated") to prevent
# misclassifying unrelated lifecycle or business errors as connection issues.
_CONNECTION_ERROR_KEYWORDS = (
"session terminated",
"session was terminated",
"closed resource",
"broken resource",
"connection closed",
"connection reset",
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:core The bug / feature is about astrbot's core, backend size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]重启MCP服务器后,AstrBot一直报错Session terminated

1 participant